This project has as its final goal the creation of a statistical model capable of predicting newborn weight on the basis of clinical data collected on 2500 newborns from three hospitals.
First, a set of libraries that will be used throughout the analysis is imported and the newborns.csv dataset is loaded.
library(dplyr)
library(kableExtra)
library(moments)
library(ggplot2)
library(GGally)
library(ggcorrplot)
library(patchwork)
library(car)
library(MASS)
library(Metrics)
library(lmtest)
library(plotly)| Mother.age | Pregnancies.n | Smoker | Gestation.w | Weight | Length | Cranium | Birth.type | Hospital | Sex |
|---|---|---|---|---|---|---|---|---|---|
| 26 | 0 | 0 | 42 | 3380 | 490 | 325 | Nat | osp3 | M |
| 21 | 2 | 0 | 39 | 3150 | 490 | 345 | Nat | osp1 | F |
| 34 | 3 | 0 | 38 | 3640 | 500 | 375 | Nat | osp2 | M |
| 28 | 1 | 0 | 41 | 3690 | 515 | 365 | Nat | osp2 | M |
| 20 | 0 | 0 | 38 | 3700 | 480 | 335 | Nat | osp3 | F |
| 32 | 0 | 0 | 40 | 3200 | 495 | 340 | Nat | osp2 | F |
1. Preliminary Analysis
In the first phase, a descriptive analysis is carried out in order to explore the variables, understand their distribution and identify any anomalies.
| Mother.age | Pregnancies.n | Smoker | Gestation.w | Weight | Length | Cranium | Birth.type | Hospital | Sex | |
|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 0.0000 | Min. :0.0000 | Min. :25.00 | Min. : 830 | Min. :310.0 | Min. :235 | Ces: 728 | osp1:816 | F:1256 | |
| 1st Qu.:25.00 | 1st Qu.: 0.0000 | 1st Qu.:0.0000 | 1st Qu.:38.00 | 1st Qu.:2990 | 1st Qu.:480.0 | 1st Qu.:330 | Nat:1772 | osp2:849 | M:1244 | |
| Median :28.00 | Median : 1.0000 | Median :0.0000 | Median :39.00 | Median :3300 | Median :500.0 | Median :340 | NA | osp3:835 | NA | |
| Mean :28.16 | Mean : 0.9812 | Mean :0.0416 | Mean :38.98 | Mean :3284 | Mean :494.7 | Mean :340 | NA | NA | NA | |
| 3rd Qu.:32.00 | 3rd Qu.: 1.0000 | 3rd Qu.:0.0000 | 3rd Qu.:40.00 | 3rd Qu.:3620 | 3rd Qu.:510.0 | 3rd Qu.:350 | NA | NA | NA | |
| Max. :46.00 | Max. :12.0000 | Max. :1.0000 | Max. :43.00 | Max. :4930 | Max. :565.0 | Max. :390 | NA | NA | NA |
The normality of the Weight variable is checked, as it
will be considered as the response variable in the regression model.
## [1] "Skewness: -0.647"
## [1] "Kurtosis: 5.032"
The Weight variable has a negative skewness index, so
the distribution is slightly left-skewed (with a longer left tail), and
a positive kurtosis index, i.e. it has a leptokurtic
distribution, as can also be seen from the density plot.
ggplot(df) +
geom_density(aes(x = Weight), col = "black", fill = "steelblue") +
labs(title = "Density of newborn weight",
x = "Weight (g)", y = "Density")##
## Shapiro-Wilk normality test
##
## data: Weight
## W = 0.97066, p-value < 2.2e-16
Running the Shapiro–Wilk test, the null hypothesis of normality for
the Weight variable is rejected. In this case, since the
distribution of the response variable is not normal, the residuals of
the regression model may also not have a normal distribution (which is
one of the assumptions of linear regression), but this will be checked
after the model has been created.
Next, a correlation matrix is created to assess the relationships
between the continuous numeric variables, and it is visualised
graphically using the ggcorrplot package.
# Correlation matrix (only for continuous numeric variables)
cor_matrix <- cor(df %>% dplyr::select(Weight, Length, Cranium, Mother.age, Gestation.w))# Visualisation of the correlation matrix with ggcorrplot
ggcorrplot(cor_matrix, hc.order = TRUE, lab = TRUE)Among the explanatory variables there do not appear to be strong correlations that would cause multicollinearity issues for the regression model, but this aspect will be further checked later using the Variance Inflation Factor (VIF).
The relationships between the continuous variables are also
visualised by means of a scatter plot matrix (pair plot), created with
the ggpairs function from the GGally
package.
# Pair plot for the continuous numeric variables
# Function to modify the parameters of the scatter plots
lower_func <- function(data, mapping, ...) {
ggplot(data, mapping) +
geom_point(size = 0.7, alpha = 0.5, position = "jitter") +
geom_smooth(method = "lm", color = "red")
}
pp = ggpairs(df,
columns = c("Weight", "Length", "Cranium", "Mother.age", "Gestation.w"),
lower = list(continuous = lower_func))
pp[5,1] = pp[5,1] + scale_x_continuous(breaks = c(1500, 3000, 4500))
pp[5,2] = pp[5,2] + scale_x_continuous(breaks = c(350, 450, 550))
ppThe impact of the qualitative variables Sex,
Smoker (mother smoking status) and Birth.type
on the Weight variable is then explored using boxplots.
# Boxplots to explore Weight with respect to qualitative variables
bp1 = ggplot(df, aes(x = "", y = Weight)) +
geom_boxplot(fill = "steelblue") +
ggtitle("Distribution of Weight (all newborns)")
bp2 = ggplot(df, aes(x = Sex, y = Weight, fill = Sex)) +
geom_boxplot() +
ggtitle("Distribution of Weight by Sex")
bp3 = ggplot(df, aes(x = as.factor(Smoker), y = Weight, fill = as.factor(Smoker))) +
geom_boxplot() +
ggtitle("Distribution of Weight by smoking status") +
scale_x_discrete(name = "Smoker", labels = c("0" = "No", "1" = "Yes")) +
scale_fill_discrete(name = "Smoker", labels = c("No", "Yes"))
bp4 = ggplot(df, aes(x = Birth.type, y = Weight, fill = Birth.type)) +
geom_boxplot() +
ggtitle("Distribution of Weight by birth type")
bp1 + bp2 + bp3 + bp42. Hypothesis Testing
In this section the following hypotheses are now tested using the appropriate statistical tests:
- In some hospitals more caesarean sections are performed
- The means of
WeightandLengthin this sample of newborns are significantly equal to those of the population - The anthropometric measurements (
Weight,Length,Cranium) are significantly different between the two sexes
For the first point, it is checked whether the distribution of
cesarean sections varies significantly between hospitals. A contingency
table is then created between the variables Hospital and
Birth.type, and a Chi-square test is
performed, which is appropriate for testing the hypothesis of
independence between qualitative variables.
# Contingency table between Hospital and Birth.type
tab_births <- table(df$Hospital, df$Birth.type)
kbl(tab_births) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Ces | Nat | |
|---|---|---|
| osp1 | 242 | 574 |
| osp2 | 254 | 595 |
| osp3 | 232 | 603 |
##
## Pearson's Chi-squared test
##
## data: tab_births
## X-squared = 1.0972, df = 2, p-value = 0.5778
From the chi-squared test we obtain a p-value of \(0.577\). Therefore, the hypothesis of
independence between the variables Hospital and
Birth.type cannot be rejected, i.e., there is no
significant difference in the birth type between the three hospitals.
Thus, no hospital performs significantly more cesarean sections.
For the second point, the sample means of the variables
Weight and Length are compared with the
reference means of the population (Weight = 3300 g, Length = 500 mm,
source: Ospedale
Bambino Gesu) to verify that they are equal. In this case, it is
appropriate to use a one-sample t-test if the data are
normal or a Wilcoxon test if they are not.
It has already been checked, via the Shapiro–Wilk test, that the
distribution of the Weight variable is not normal, so a
Wilcoxon test is carried out.
##
## Wilcoxon signed rank test with continuity correction
##
## data: Weight
## V = 1495594, p-value = 0.9612
## alternative hypothesis: true location is not equal to 3300
The Wilcoxon test yields a p-value of \(0.961\). Therefore the null hypothesis cannot be rejected: the mean weight of the newborns in the sample can be considered not significantly different from that of the population.
A Shapiro–Wilk test is now performed to check the normality of the
Length variable.
##
## Shapiro-Wilk normality test
##
## data: Length
## W = 0.90941, p-value < 2.2e-16
Here too the Length variable does not follow a normal
distribution, so a Wilcoxon test is carried out with
null hypothesis of equality to the population mean (500 mm).
##
## Wilcoxon signed rank test with continuity correction
##
## data: Length
## V = 877236, p-value < 2.2e-16
## alternative hypothesis: true location is not equal to 500
With a p-value \(< 2.2\times10^{-16}\) the null hypothesis that the mean length of this sample of newborns is equal to that of the population is rejected. Therefore, it can be stated that the mean length calculated for this sample, equal to 494.7, is significantly lower than the population mean.
For the third point, the distributions of Weight,
Length, and Cranium between male and female
newborns are compared to verify that they are significantly different
between the two sexes. In this case, a Two-sample independent
t-test (defined based on the categorical variable
Sex) is used if the data are normal, or a Wilcoxon
test if they are not. It had already been verified, using the
Shapiro-Wilk test, that the data for the variables Weight
and Length do not have a normal distribution, so the
Wilcoxon test is performed.
##
## Wilcoxon rank sum test with continuity correction
##
## data: Weight by Sex
## W = 538641, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
With a p-value \(< 2.2\times10^{-16}\) the null hypothesis that the mean weight is the same in the two sexes is rejected: there is a significant difference in weight between males and females.
##
## Wilcoxon rank sum test with continuity correction
##
## data: Length by Sex
## W = 594455, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
Here again, with a p-value \(< 2.2\times10^{-16}\) the null hypothesis that the mean length is the same in the two sexes is rejected: the mean length differs significantly between males and females.
A Shapiro–Wilk test is now performed to check the normality of the
Cranium variable.
##
## Shapiro-Wilk normality test
##
## data: Cranium
## W = 0.96357, p-value < 2.2e-16
Since the Cranium variable also does not have a normal
distribution, a Wilcoxon test is carried out.
##
## Wilcoxon rank sum test with continuity correction
##
## data: Cranium by Sex
## W = 641638, p-value = 9.633e-15
## alternative hypothesis: true location shift is not equal to 0
With a p-value of \(9.633 \times 10^{-15}\) the null hypothesis that the mean cranium diameter is the same between the two sexes is rejected: the measurement differs significantly between males and females.
3. Building the Regression Model
In this phase, a multiple linear regression model is developed in
which Weight is the response variable and all variables in
the dataset are included as explanatory variables. In this way, it is
possible to quantify the impact of each independent variable on newborn
weight.
##
## Call:
## lm(formula = Weight ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1124.40 -181.66 -14.42 160.91 2611.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6738.4762 141.3087 -47.686 < 2e-16 ***
## Mother.age 0.8921 1.1323 0.788 0.4308
## Pregnancies.n 11.2665 4.6608 2.417 0.0157 *
## Smoker -30.1631 27.5386 -1.095 0.2735
## Gestation.w 32.5696 3.8187 8.529 < 2e-16 ***
## Length 10.2945 0.3007 34.236 < 2e-16 ***
## Cranium 10.4707 0.4260 24.578 < 2e-16 ***
## Birth.typeNat 29.5254 12.0844 2.443 0.0146 *
## Hospitalosp2 -11.2095 13.4379 -0.834 0.4043
## Hospitalosp3 28.0958 13.4957 2.082 0.0375 *
## SexM 77.5409 11.1776 6.937 5.08e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.9 on 2489 degrees of freedom
## Multiple R-squared: 0.7289, Adjusted R-squared: 0.7278
## F-statistic: 669.2 on 10 and 2489 DF, p-value: < 2.2e-16
The model summary shows the regression coefficient estimates for all
variables, which represent the marginal effects of the individual
explanatory variables on the response variable Weight.
However, although all variables (with the exception of the
Mother.age variable ) have an effect on the
Weight variable, not all variables are significant, as
indicated by the high p-values.
This suggests the possibility of developing a better model by
excluding some of the explanatory variables included in this initial
model. In particular, it is likely that eliminating the variables
Mother.age, Smoker, and Hospital
would result in a better and more parsimonious model.
The Variance Inflation Factor (VIF) is also computed to assess possible multicollinearity between the independent variables included in the model.
## GVIF Df GVIF^(1/(2*Df))
## Mother.age 1.187454 1 1.089704
## Pregnancies.n 1.186428 1 1.089233
## Smoker 1.007392 1 1.003689
## Gestation.w 1.695810 1 1.302233
## Length 2.085755 1 1.444214
## Cranium 1.630796 1 1.277026
## Birth.type 1.004242 1 1.002119
## Hospital 1.004071 2 1.001016
## Sex 1.040643 1 1.020119
Since the VIF values are below 5 for all variables, there is no evidence of problematic multicollinearity, as also suggested above by observing the correlation matrix. Therefore, it is not necessary to remove variables for this reason.
4. Selection of the Optimal Model
In this phase, the most parsimonious model is selected, i.e. the
model that balances goodness of fit and complexity, thus eliminating
non-significant variables. This procedure is performed using the
stepAIC function of the “MASS” library, which applies a
stepwise selection to minimize the model’s information criterion. In
this case, the BIC (Bayesian Information Criterion) is used as
criterion, which tends to be more parsimonious in the selection,
retaining only variables that are strictly relevant.
# Selection of the optimal model with stepwise procedure based on BIC
n = nrow(df)
stepwise.mod = stepAIC(mod1, direction = "both", k = log(n))## Start: AIC=28139.32
## Weight ~ Mother.age + Pregnancies.n + Smoker + Gestation.w +
## Length + Cranium + Birth.type + Hospital + Sex
##
## Df Sum of Sq RSS AIC
## - Mother.age 1 46578 186809099 28132
## - Smoker 1 90019 186852540 28133
## - Hospital 2 685979 187448501 28133
## - Pregnancies.n 1 438452 187200974 28137
## - Birth.type 1 447929 187210450 28138
## <none> 186762521 28139
## - Sex 1 3611021 190373542 28179
## - Gestation.w 1 5458403 192220925 28204
## - Cranium 1 45326172 232088693 28675
## - Length 1 87951062 274713583 29096
##
## Step: AIC=28132.12
## Weight ~ Pregnancies.n + Smoker + Gestation.w + Length + Cranium +
## Birth.type + Hospital + Sex
##
## Df Sum of Sq RSS AIC
## - Smoker 1 90897 186899996 28126
## - Hospital 2 692738 187501837 28126
## - Birth.type 1 448222 187257321 28130
## <none> 186809099 28132
## - Pregnancies.n 1 633756 187442855 28133
## + Mother.age 1 46578 186762521 28139
## - Sex 1 3618736 190427835 28172
## - Gestation.w 1 5412879 192221978 28196
## - Cranium 1 45588236 232397335 28670
## - Length 1 87950050 274759149 29089
##
## Step: AIC=28125.51
## Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Birth.type +
## Hospital + Sex
##
## Df Sum of Sq RSS AIC
## - Hospital 2 701680 187601677 28119
## - Birth.type 1 440684 187340680 28124
## <none> 186899996 28126
## - Pregnancies.n 1 610840 187510837 28126
## + Smoker 1 90897 186809099 28132
## + Mother.age 1 47456 186852540 28133
## - Sex 1 3602797 190502794 28165
## - Gestation.w 1 5346781 192246777 28188
## - Cranium 1 45632149 232532146 28664
## - Length 1 88355030 275255027 29086
##
## Step: AIC=28119.23
## Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Birth.type +
## Sex
##
## Df Sum of Sq RSS AIC
## - Birth.type 1 463870 188065546 28118
## <none> 187601677 28119
## - Pregnancies.n 1 651066 188252743 28120
## + Hospital 2 701680 186899996 28126
## + Smoker 1 99840 187501837 28126
## + Mother.age 1 54392 187547285 28126
## - Sex 1 3649259 191250936 28160
## - Gestation.w 1 5444109 193045786 28183
## - Cranium 1 45758101 233359778 28657
## - Length 1 88054432 275656108 29074
##
## Step: AIC=28117.58
## Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Sex
##
## Df Sum of Sq RSS AIC
## <none> 188065546 28118
## - Pregnancies.n 1 623141 188688687 28118
## + Birth.type 1 463870 187601677 28119
## + Hospital 2 724866 187340680 28124
## + Smoker 1 91892 187973654 28124
## + Mother.age 1 54816 188010731 28125
## - Sex 1 3655292 191720838 28158
## - Gestation.w 1 5464853 193530399 28181
## - Cranium 1 46108583 234174130 28658
## - Length 1 87632762 275698308 29066
##
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length +
## Cranium + Sex, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1149.44 -180.81 -15.58 163.64 2639.72
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6681.1445 135.7229 -49.226 < 2e-16 ***
## Pregnancies.n 12.4750 4.3396 2.875 0.00408 **
## Gestation.w 32.3321 3.7980 8.513 < 2e-16 ***
## Length 10.2486 0.3006 34.090 < 2e-16 ***
## Cranium 10.5402 0.4262 24.728 < 2e-16 ***
## SexM 77.9927 11.2021 6.962 4.26e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.6 on 2494 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.7265
## F-statistic: 1328 on 5 and 2494 DF, p-value: < 2.2e-16
After the selection procedure, the best model according to BIC
includes the following variables: Pregnancies.n,
Gestation.w, Length, Cranium,
Sex.
## df BIC
## mod1 12 35241.84
## stepwise.mod 7 35220.10
Compared to the initial model, the selected model has a lower BIC, and is therefore preferable from the point of view of both simplicity and interpretation.
Models with interactions between some of the predictors are now
tested. In particular, a model including the interaction between
Pregnancies.n and Gestation.w is first
considered:
# Model with interaction between Pregnancies.n and Gestation.w
mod_int_1 = lm(
formula = Weight ~ Pregnancies.n * Gestation.w + Length + Cranium + Sex,
data = df
)##
## Call:
## lm(formula = Weight ~ Pregnancies.n * Gestation.w + Length +
## Cranium + Sex, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1148.72 -180.77 -14.14 163.93 2638.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6588.5062 158.7705 -41.497 < 2e-16 ***
## Pregnancies.n -66.1971 70.1095 -0.944 0.345
## Gestation.w 29.9109 4.3658 6.851 9.20e-12 ***
## Length 10.2482 0.3006 34.091 < 2e-16 ***
## Cranium 10.5463 0.4263 24.741 < 2e-16 ***
## SexM 77.9137 11.2017 6.956 4.47e-12 ***
## Pregnancies.n:Gestation.w 2.0278 1.8036 1.124 0.261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 274.6 on 2493 degrees of freedom
## Multiple R-squared: 0.7271, Adjusted R-squared: 0.7265
## F-statistic: 1107 on 6 and 2493 DF, p-value: < 2.2e-16
In this case, the interaction term
(Pregnancies.n:Gestation.w) is not significant (p-value =
0.261).
## df BIC
## stepwise.mod 7 35220.10
## mod_int_1 8 35226.66
The BIC of this model is also higher than that of the stepwise model,
so the interaction between Pregnancies.n and
Gestation.w does not improve the model.
A model with interaction between Length and
Cranium is then tested:
# Model with interaction between Sex and Gestation.w
mod_int_2 = lm(
formula = Weight ~ Pregnancies.n + Gestation.w + Length * Cranium + Sex,
data = df
)##
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length *
## Cranium + Sex, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1150.74 -180.48 -13.62 165.82 2866.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.801e+03 1.018e+03 -1.770 0.07685 .
## Pregnancies.n 1.295e+01 4.321e+00 2.996 0.00276 **
## Gestation.w 3.810e+01 3.964e+00 9.610 < 2e-16 ***
## Length -3.047e-01 2.202e+00 -0.138 0.88996
## Cranium -4.759e+00 3.191e+00 -1.491 0.13601
## SexM 7.327e+01 1.119e+01 6.545 7.20e-11 ***
## Length:Cranium 3.158e-02 6.528e-03 4.838 1.39e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.4 on 2493 degrees of freedom
## Multiple R-squared: 0.7295, Adjusted R-squared: 0.7289
## F-statistic: 1121 on 6 and 2493 DF, p-value: < 2.2e-16
The interaction term (Length:Cranium) is here
significant (p-value = \(1.39 \times
10^{-6}\))
## df BIC
## stepwise.mod 7 35220.10
## mod_int_2 8 35204.57
In this case the BIC is also lower than that of the stepwise model. An ANOVA test is also performed, which takes into account the variance explained by the two models:
## Analysis of Variance Table
##
## Model 1: Weight ~ Pregnancies.n + Gestation.w + Length * Cranium + Sex
## Model 2: Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2493 186316621
## 2 2494 188065546 -1 -1748926 23.401 1.395e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result of the ANOVA test confirms that there is a significant
increase in explained variance when the interaction is added, so
mod_int_2 can be considered a better model.
Possible non-linear effects are now tested. In particular, by
observing the prevously created scatter plots, the quadratic effect of
the Length variable is first added to the model:
##
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length +
## Cranium + Sex + I(Length^2) + Length:Cranium, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1176.69 -179.20 -11.78 165.68 1306.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.262e+03 1.003e+03 -2.256 0.024135 *
## Pregnancies.n 1.425e+01 4.254e+00 3.350 0.000821 ***
## Gestation.w 4.042e+01 3.909e+00 10.339 < 2e-16 ***
## Length -2.137e+01 3.169e+00 -6.744 1.91e-11 ***
## Cranium 2.741e+01 4.725e+00 5.800 7.46e-09 ***
## SexM 7.186e+01 1.102e+01 6.523 8.29e-11 ***
## I(Length^2) 4.476e-02 4.914e-03 9.109 < 2e-16 ***
## Length:Cranium -3.449e-02 9.689e-03 -3.560 0.000378 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 269 on 2492 degrees of freedom
## Multiple R-squared: 0.7383, Adjusted R-squared: 0.7375
## F-statistic: 1004 on 7 and 2492 DF, p-value: < 2.2e-16
## df BIC
## mod_int_2 8 35204.57
## mod_quad_1 9 35130.51
The quadratic term of the Length variable resulted to be
very significant (p-value \(<
2\times10^{-16}\) ). In addition, there is a further decrease in
BIC and also a one point increase in the adjusted R-squared
coefficient.
The quadratic effect of the Gestation.w variable is now
also added:
##
## Call:
## lm(formula = Weight ~ Pregnancies.n + Gestation.w + Length +
## Cranium + Sex + I(Length^2) + I(Gestation.w^2) + Length:Cranium,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1191.64 -181.49 -12.11 165.26 1325.43
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.400e+03 1.048e+03 -3.244 0.001193 **
## Pregnancies.n 1.451e+01 4.245e+00 3.418 0.000640 ***
## Gestation.w 2.862e+02 6.769e+01 4.228 2.44e-05 ***
## Length -3.082e+01 4.091e+00 -7.532 6.93e-14 ***
## Cranium 2.042e+01 5.090e+00 4.012 6.19e-05 ***
## SexM 7.328e+01 1.100e+01 6.664 3.28e-11 ***
## I(Length^2) 4.947e-02 5.070e-03 9.757 < 2e-16 ***
## I(Gestation.w^2) -3.227e+00 8.872e-01 -3.637 0.000281 ***
## Length:Cranium -2.046e-02 1.041e-02 -1.966 0.049358 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 268.3 on 2491 degrees of freedom
## Multiple R-squared: 0.7396, Adjusted R-squared: 0.7388
## F-statistic: 884.6 on 8 and 2491 DF, p-value: < 2.2e-16
## df BIC
## mod_quad_1 9 35130.51
## mod_quad_2 10 35125.09
The quadratic term of the Gestation.w variable also
results to be highly significant (p-value = \(0.0002\)). Moreover, there is a further
decrease in BIC and a slight increase in the adjusted
R-squared. An ANOVA test is performed to compare the models:
## Analysis of Variance Table
##
## Model 1: Weight ~ Pregnancies.n + Gestation.w + Length + Cranium + Sex +
## I(Length^2) + I(Gestation.w^2) + Length:Cranium
## Model 2: Weight ~ Pregnancies.n + Gestation.w + Length * Cranium + Sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2491 179360498
## 2 2493 186316621 -2 -6956123 48.304 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result of the ANOVA test confirms that there is a significant
gain in explained variance when the quadratic terms of
Length and Gestation.w variables are included;
mod_quad_2 is therefore considered the best model among
those fitted.
5. Analysis of Model Quality
In this phase, the predictive ability of the final model is first compared with that of the previous models, using three metrics:
BIC
adjusted R-squared
Root Mean Squared Error (RMSE)
RMSE is a metric that represents the average distance between the observed values and the values predicted by the model. A lower RMSE indicates a better ability of the model to predict data.
# List of models
models <- list(
"Base" = mod1,
"Stepwise" = stepwise.mod,
"Interaction" = mod_int_2,
"Quadratic" = mod_quad_2
)# Computation of the metrics for each model
metrics <- data.frame(
BIC = sapply(models, BIC),
Adjusted_R2 = sapply(models, function(m) summary(m)$adj.r.squared),
RMSE = sapply(models, function(m) rmse(df$Weight, predict(m)))
)| BIC | Adjusted_R2 | RMSE | |
|---|---|---|---|
| Base | 35241.84 | 0.7278038 | 273.3222 |
| Stepwise | 35220.10 | 0.7264542 | 274.2740 |
| Interaction | 35204.57 | 0.7288893 | 272.9957 |
| Quadratic | 35125.09 | 0.7388017 | 267.8511 |
From the summary table, all the metrics (lower BIC, higher adjusted R-squared, lower RMSE) are better for the final model, which includes the quadratic terms, confirming that it is the most appropriate among those considered.
A residual analysis of the model is now carried out.
From the diagnostic plots, residuals are largely normally distributed, but there are some values, particularly in the upper tail, showing some deviations from normality. Furthermore, the residuals do not appear to be evenly distributed around the mean of 0, i.e. they show a slight heteroscedasticity.
Lastly, looking at the last of the four standard diagnostic plots, which shows potential influential observations on regression estimates (Cook’s distance), it can be seen that one of the observations exceeds both problematic thresholds of 0.5 and 1.
Residual tests are now performed.
##
## Shapiro-Wilk normality test
##
## data: residuals(mod_quad_2)
## W = 0.99039, p-value = 6.954e-12
ggplot() +
geom_density(aes(x = residuals(mod_quad_2)), col = "black", fill = "steelblue") +
labs(title = "Density of residuals",
x = "Residuals", y = "Density")The Shapiro-Wilk test confirms that the residuals do not appear to be perfectly normal. As can be seen from the density graph, this is mainly due to some extreme values in the two tails.
##
## studentized Breusch-Pagan test
##
## data: mod_quad_2
## BP = 134.53, df = 8, p-value < 2.2e-16
The Breusch-Pagan test for homoscedasticity also indicates that the variance of the residuals is not constant, as previously suggested by the diagnostic plots.
##
## Durbin-Watson test
##
## data: mod_quad_2
## DW = 1.9471, p-value = 0.09295
## alternative hypothesis: true autocorrelation is greater than 0
From the Durbin–Watson test, the null hypothesis of no autocorrelation is not rejected, so the residuals do not show evidence of autocorrelation.
To identify precisely any influential values, leverage values (hat values) are examined, which correspond to observations located far out in the space of the explanatory variables.
# Leverage values
hat = hatvalues(mod_quad_2)
p = sum(hat)
threshold = 2 * p/n
plot(hat, pch = 20, main = "Leverage", ylab = "Hat values")
abline(h = threshold, col = 2)From the plot there are several leverage values that exceed the threshold (2·p/n), one of which by a considerable margin.
Extreme values of the response variable (outliers) are also identified through the analysis of studentized residuals.
# Studentized residuals
plot(rstudent(mod_quad_2), pch = 20,
main = "Studentized residuals", ylab = "Studentized residuals")## rstudent unadjusted p-value Bonferroni p
## 1551 6.249419 4.8311e-10 1.2078e-06
## 1306 4.968338 7.2110e-07 1.8028e-03
## 1399 -4.463879 8.4073e-06 2.1018e-02
## 155 4.387990 1.1917e-05 2.9792e-02
## 1694 4.290034 1.8546e-05 4.6366e-02
By performing a test on studentized residuals, it appears that in this case 5 values are classified as outliers, and therefore potentially influential values.
The combined effect of leverage and outliers is now considered by computing and visualising Cook’s distances:
cook = cooks.distance(mod_quad_2)
plot(cook, pch = 20, main = "Cook's distance")
abline(h = 0.5, col = 2)## 1551
## 5.311502
As already noted, only one value exceeds the Cook’s distance threshold of 0.5 and can therefore be considered influential. However, assuming that the observation is not a measurement error, the model can be considered sufficiently adequate.
6. Predictions and Results
In this phase, the model is used to make predictions, estimating the
weight of a newborn by considering different combinations of values for
the explanatory variables. To simplify the process, a function is
created that takes the values of the explanatory variables of the model
as arguments (with the median values of the dataset set as default
values) and returns the prediction of the Weight
variable.
# Function to predict newborn weight
weight_prediction = function(Pregnancies.n = 1,
Gestation.w = 39,
Length = 500,
Cranium = 340,
Sex = "M") {
new_data <- data.frame(Pregnancies.n = c(Pregnancies.n),
Gestation.w = c(Gestation.w),
Length = c(Length),
Cranium = c(Cranium),
Sex = as.factor(c(Sex)))
predicted_weight = predict(mod_quad_2, newdata = new_data)
return(cat("Based on the following values:\n",
"\nNumber of pregnancies:", Pregnancies.n,
"\nWeeks of gestation:", Gestation.w,
"\nLength of the newborn (mm):", Length,
"\nCranium diameter of the newborn (mm):", Cranium,
"\nSex of the newborn:", Sex,
"\n\nThe estimated weight of the newborn (g) is:", predicted_weight))
}Without specifying any arguments, the function returns the prediction with the default values:
## Based on the following values:
##
## Number of pregnancies: 1
## Weeks of gestation: 39
## Length of the newborn (mm): 500
## Cranium diameter of the newborn (mm): 340
## Sex of the newborn: M
##
## The estimated weight of the newborn (g) is: 3364.558
Setting the Sex variable to "F", the
function returns the prediction for a female newborn:
## Based on the following values:
##
## Number of pregnancies: 1
## Weeks of gestation: 39
## Length of the newborn (mm): 500
## Cranium diameter of the newborn (mm): 340
## Sex of the newborn: F
##
## The estimated weight of the newborn (g) is: 3291.283
It is possible to insert different combinations of arguments
(e.g. Pregnancies.n, Gestation.w,
Sex) to modify the values of the variables and return the
corresponding prediction:
## Based on the following values:
##
## Number of pregnancies: 3
## Weeks of gestation: 39
## Length of the newborn (mm): 500
## Cranium diameter of the newborn (mm): 340
## Sex of the newborn: M
##
## The estimated weight of the newborn (g) is: 3393.577
## Based on the following values:
##
## Number of pregnancies: 3
## Weeks of gestation: 39
## Length of the newborn (mm): 500
## Cranium diameter of the newborn (mm): 340
## Sex of the newborn: F
##
## The estimated weight of the newborn (g) is: 3320.302
## Based on the following values:
##
## Number of pregnancies: 1
## Weeks of gestation: 28
## Length of the newborn (mm): 500
## Cranium diameter of the newborn (mm): 340
## Sex of the newborn: M
##
## The estimated weight of the newborn (g) is: 2594.537
## Based on the following values:
##
## Number of pregnancies: 1
## Weeks of gestation: 28
## Length of the newborn (mm): 500
## Cranium diameter of the newborn (mm): 340
## Sex of the newborn: F
##
## The estimated weight of the newborn (g) is: 2521.262
By entering the values of all the variables included in the model, the predicted weight is estimated with the highest precision allowed by the model:
## Based on the following values:
##
## Number of pregnancies: 2
## Weeks of gestation: 33
## Length of the newborn (mm): 512
## Cranium diameter of the newborn (mm): 345
## Sex of the newborn: F
##
## The estimated weight of the newborn (g) is: 3179.727
7. Visualizations
Finally, several plots are created to visualise some of the most important relationships between the variables included in the model.
For example, the relationship between the response variable
Weight and the explanatory variable
Gestation.w is shown by a scatter plot with colours
representing Sex (also showing the corresponding regression
lines) and different point sizes for Length:
ggplot(data = df) +
geom_point(aes(x = Gestation.w, y = Weight, colour = Sex, size = Length),
alpha = 0.2,
position = "jitter") +
geom_smooth(aes(x = Gestation.w, y = Weight, colour = Sex),
se = FALSE, method = "lm") +
labs(title = "Impact of Gestation, Sex and Length on Weight")Then the relationship between Weight and
Length is visualized, using point size to represent the
Gestation.w variable:
ggplot(data = df) +
geom_point(aes(x = Length, y = Weight, colour = Sex, size = Gestation.w),
alpha = 0.2,
position = "jitter") +
geom_smooth(aes(x = Length, y = Weight, colour = Sex),
se = FALSE, method = "lm") +
labs(title = "Impact of Length, Sex and Gestation on Weight")Using the plotly library, a three-dimensional scatter
plot is created to display the joint effect of Gestation.w
and Length on the response variable Weight,
keeping two different colours for Sex:
fig <- plot_ly(data = df,
x = ~Gestation.w, y = ~Length, z = ~Weight,
color = ~Sex,
opacity = 0.5,
marker = list(size = 5)) %>%
layout(title = "Impact of Gestation, Length and Sex on Weight",
legend = list(title = list(text = "Sex")))
figMorever, in this way the point size can also be used to include a
fifth variable, for example Pregnancies.n:
fig <- plot_ly(data = df,
x = ~Gestation.w, y = ~Length, z = ~Weight,
color = ~Sex, size = ~Pregnancies.n, sizes = c(50, 500),
text = ~paste('Number of pregnancies:', Pregnancies.n)) %>%
layout(title = "Impact of Gestation, Length, Sex and Pregnancies num. on Weight",
legend = list(title = list(text = "Sex")))
fig8. Conclusions
This inferential analysis on clinical data from 2500 newborns allowed to identify the main factors that influence birth weight and to build a regression model capable of predicting newborn weight from measurable clinical variables. Such a model may be useful to rapidly detect possible anomalies and plan in advance the use of intensive care.
The analysis has also allowed several preliminary questions to be answered. Specifically, it was found that caesarean sections do not appear to be significantly more frequent in one hospital than in the others. Moreover, the mean of weight in the three hospitals considered is not significantly different from that of the reference population, whereas length is significantly lower. Lastly, the hypothesis that the anthropometric measurements (weight, length and cranium diameter) are significantly different between the two sexes has been confirmed.
Some interesting conclusions emerged from the multiple regression analysis. For example, it was found that factors such as maternal smoking and mother’s age do not significantly affect birth weight. Instead, duration of gestation and length have a strong effect on weight, and their relationship with weight is not linear, as shown by the improvement obtained by including quadratic terms. However, further work including other samples obtained from a higher number of hospitals or additional variables could lead to different results and more generalizable conclusions.